In the early 2000s, the New York City Police Department was required to publicly report data on their stop-and-frisk program. This program allows officers to temporarily stop, question, and potentially frisk/search civilians they suspect may have committed a crime without needing evidence or a warrant. The stops may result in release, arrest, or criminal summons. As civilians can be stopped without evidence, we want to explore what variables are the most accurate prediction of the stop resulting in arrest. In order to predict arrests, we will use classification methods. WRITE MORE ABOUT METHODS, FINDINGS, and CONCLUSIONS
The goal of our project is to predict Y. edit, include key findings
For over 20 years, civilians in NYC may be stopped, question, and searched by the police based solely on suspicion without any evidence. From 2002 – 2012, under Mayor Bloomberg’s administration, stops were more rampant, reaching a peak in 2011 with 685,724 reported stops (NYCLU 2024). With changes to administrations, the number of stops has decreased in recent years with 15,102 and 16,791 stops in 2022 and 2023 respectively. The stop-and-frisk program has been highly criticized as it does not require evidence to stop individuals and may allow police to use racial profiling tactics to target individuals. For example, the American Civil Liberties Union of New York (NYCLU) has criticized the program as their investigation of the data found a disproportionate number of civilians stopped are people of color, in particular Black civilians, and some stops have resulted in police misconduct. As the stops do not require previous evidence or warrants, we are interested in predicting which stops result in arrest, particularly focusing on factors gathered before the stop begins that may predict arrest, such as demographic characteristics of civilians, location, time of day, and attributes of the police officers.
Target variable: Our goal is to predict
arrested_flag, which is a binary indicator which equals 1
if the individual who was stopped was arrested, and 0 otherwise.1
Motivation: edit here re relevance, cite some literature
Relevant features: edit
Methods: edit
We use the New York Police Department Stop, Question and Frisk Data from 2023.
This is a stop-level dataset; each observation (row) corresponds to a unique stop made by an NYPD police officer in 2023 as part of the SQF programme.
How data is collected & what universe is/is not observed (some note here on the forms they have to fill out to record the stop, refer Precinct and Prejudice paper on why some stops may not actually be recorded)
Temporal span: the entire year of 2023.
Spatial span: the 77 NYPD police precincts covering all 5 boroughs of NYC.
Dimension: The data has 16971 observations and 82 features, as shown below.
We begin the project by installing and loading in the necessary libraries.
# Install and load required packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
library(pacman)
p_load(readxl, dplyr, ggplot2, knitr, lubridate, tidyr, sf, httr, caret, glmnet, stringr, remotes, RColorBrewer, viridis, scales, classInt, forcats, pROC, randomForest)
We proceed by setting up the file path and importing the data from our project’s GitHub repository.
# get raw content of the file
response <- GET("https://raw.githubusercontent.com/rrachelkane/data-science-group-project/main/data/sqf-2023.xlsx")
# retrieve the .xlsx file
if (status_code(response) == 200) {
# create a temporary file to save the downloaded content
temp_file <- tempfile(fileext = ".xlsx")
# Write the raw content to the temporary file
writeBin(content(response, "raw"), temp_file)
# Read the Excel file from the temporary file
sqf_data <- read_xlsx(temp_file)
# View the first few rows of the data
head(sqf_data)
} else {
stop("Failed to download the file.")
}
## # A tibble: 6 × 82
## STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 STOP_WAS_INITIATED
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 2023-01-01 00:44:00 2023 January Sund… Based on Radio Run
## 2 2 2023-01-01 00:49:00 2023 January Sund… Based on Self Ini…
## 3 3 2023-01-01 05:31:00 2023 January Sund… Based on Radio Run
## 4 4 2023-01-01 04:59:00 2023 January Sund… Based on Self Ini…
## 5 5 2023-01-01 05:21:00 2023 January Sund… Based on Self Ini…
## 6 6 2023-01-01 18:00:00 2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## # ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## # SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## # SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## # LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## # JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## # SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
# check original dimensions
dim(sqf_data)
## [1] 16971 82
# view head
head(sqf_data)
## # A tibble: 6 × 82
## STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 STOP_WAS_INITIATED
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 2023-01-01 00:44:00 2023 January Sund… Based on Radio Run
## 2 2 2023-01-01 00:49:00 2023 January Sund… Based on Self Ini…
## 3 3 2023-01-01 05:31:00 2023 January Sund… Based on Radio Run
## 4 4 2023-01-01 04:59:00 2023 January Sund… Based on Self Ini…
## 5 5 2023-01-01 05:21:00 2023 January Sund… Based on Self Ini…
## 6 6 2023-01-01 18:00:00 2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## # ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## # SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## # SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## # LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## # JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## # SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
First, we change column names from strictly upper case to strictly lower case, because it’s cuter.
colnames(sqf_data) <- tolower(colnames(sqf_data))
# check
colnames(sqf_data)[1:3]
## [1] "stop_id" "stop_frisk_date" "stop_frisk_time"
Next, we drop all columns which cannot be used for our prediction
question, as they are realized after the outcome of
interest, namely arrested_flag, or are irrelevant for other
reasons. We drop spatial features other than x and y coordinates of the
stop, as this is the most granular spatial information and we use this
to map each stop to census tracts to obtain demographic information
corresponding to the location of the stop.These dropped spatial features
also have high cardinality, which would add many dummies to our model,
undermining computational efficiency.
sqf_data <- sqf_data %>%
select(- c("stop_frisk_date", "record_status_code", "supervising_action_corresponding_activity_log_entry_reviewed", "stop_location_sector_code", "stop_location_apartment", "stop_location_full_address", "stop_location_patrol_boro_name", "stop_location_street_name", "suspect_other_description", "observed_duration_minutes", "stop_duration_minutes", "summons_issued_flag", "supervising_officer_command_code", "issuing_officer_command_code", "stop_location_precinct", "year2", "suspect_arrest_offense"))
# check new dim
dim(sqf_data)
## [1] 16971 65
First, we note that there is only 1 column with any instance of an
NA value.
na_cols <- colMeans(is.na(sqf_data)) * 100
na_cols[na_cols > 0]
## demeanor_of_person_stopped
## 15.01385
The process generating the missingness of
demeanor_of_person_stopped is unclear.
sqf_data[1:5, "demeanor_of_person_stopped"]
## # A tibble: 5 × 1
## demeanor_of_person_stopped
## <chr>
## 1 YELLING AND KNOCKING ON THE DOOR
## 2 VILIGANT
## 3 UPSET
## 4 CALM
## 5 EVASIVE
Imputation of this would be difficult, so we drop this column.
# drop
sqf_data <- sqf_data %>%
select(-("demeanor_of_person_stopped"))
# check new dim
dim(sqf_data)
## [1] 16971 64
Additionally, there are many observations in the data with values ==
(null) across different columns.
# get % of nulls, in columns with at least one null
null_cols <- (colMeans(sqf_data == "(null)") * 100)[colMeans(sqf_data == "(null)") * 100 > 0]
# make df for plot
null_cols_df <- data.frame(Feature = names(null_cols), Percentage = null_cols)
dim(null_cols_df)
## [1] 47 2
# order for plot
null_cols_df$Feature <- factor(null_cols_df$Feature,
levels = null_cols_df$Feature[order(null_cols_df$Percentage, decreasing = FALSE)])
# plot
ggplot(null_cols_df, aes(x = Feature, y = Percentage)) +
geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
labs(title = "Percentage of (null) Values per Column",
x = "Columns",
y = "Percentage of (null) Values") +
coord_flip() + # Flip coordinates
theme_minimal()
Note, however, that not all of these (null) observations
are equivalent:
(null) values, (null) means the data are
genuinely effectively NA, as there are
instances of both “Y” and “N” (for binary variable for example),
alongside (null).sqf_data %>%
group_by(ask_for_consent_flg) %>%
summarise(N = n()) %>%
kable()
| ask_for_consent_flg | N |
|---|---|
| (null) | 779 |
| N | 14187 |
| Y | 2005 |
null values are actually
used as “N”.print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
sqf_data %>%
group_by(weapon_found_flag, firearm_flag) %>%
summarise(N = n()) %>%
kable()
## `summarise()` has grouped output by 'weapon_found_flag'. You can override using
## the `.groups` argument.
| weapon_found_flag | firearm_flag | N |
|---|---|---|
| N | (null) | 14310 |
| N | Y | 28 |
| Y | (null) | 1495 |
| Y | Y | 1138 |
# note that for the identifying variables related to cops, "yes" entries are indicated unusually
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
print(unique(sqf_data$verbal_identifies_officer_flag))
## [1] "(null)" "V"
print(unique(sqf_data$shield_identifies_officer_flag))
## [1] "(null)" "S"
Note here that even though a firearm_flag has a
"Y" entry and weapon_found_flag has a
"N" entry, this is not necessarily incorrect, as the
officer can have identified a firearm without having to carry out a
frisk nor a `search.
We deal with these cases of (null) separately:
(null) with
"N" values# initialize empty vector
null_2 <- c()
# loop through columns
# loop through columns
for (col in names(sqf_data)) {
# Get unique values of the column
unique_values <- unique(sqf_data[[col]])
# Check if unique values are exactly a subset of "Y", "I", "V", "S", and "(null)"
if (all(unique_values %in% c("Y", "I", "V", "S", "(null)")) && length(unique_values) == 2) {
null_2 <- c(null_2, col) # Add column name to null_2
}
}
# check n of type 2 nulls
length(null_2)
## [1] 30
# pre-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
# replace these nulls with Ns
sqf_data <- sqf_data %>%
mutate(across(all_of(null_2), ~ ifelse(. == "(null)", "N", .)))
# post-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
NA values:# initialize empty vector
null_1 <- c()
# loop through columns
for (col in names(sqf_data)) {
# for columns not in null_2
if (!(col %in% null_2)) {
# if "(null)" is present in the column
if ("(null)" %in% sqf_data[[col]]) {
null_1 <- c(null_1, col) # add column name to the vector
}
}
}
# check length
length(null_1)
## [1] 17
# pre-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" "(null)"
# replace these with NAs
sqf_data <- sqf_data %>%
mutate(across(all_of(null_1), ~ ifelse(. == "(null)", NA, .)))
# post-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" NA
Now, we percentage of actual missing values, correctly identified by
"NA":
# get % of NAs, in columns with at least one NA
na_cols <- (colMeans(is.na(sqf_data)) * 100)[colMeans(is.na(sqf_data)) * 100 > 0]
# make df for plot
na_cols_df <- data.frame(Feature = names(na_cols), Percentage = na_cols)
# order for plot
na_cols_df$Feature <- factor(na_cols_df$Feature,
levels = na_cols_df$Feature[order(na_cols_df$Percentage, decreasing = FALSE)])
# plot
ggplot(na_cols_df, aes(x = Feature, y = Percentage)) +
geom_bar(stat = "identity", fill = "#F8566D", color = "black") +
labs(title = "Percentage of NA Values per Column",
x = "Columns",
y = "Percentage of NA Values") +
coord_flip() + # Flip coordinates
theme_minimal()
Given the above, we
sqf_data <- sqf_data %>%
select(-all_of(names(na_cols[na_cols > 25])))
dim(sqf_data)
## [1] 16971 59
sqf_data <- sqf_data %>%
filter(!if_any(everything(), is.na))
dim(sqf_data)
## [1] 12095 59
We change the coding of binary variables:
# pre check
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
# clean Ys and Ns and set as numeric
sqf_data <- sqf_data %>%
mutate(across(
where(~ all(. %in% c("Y", "N", "I", "V", "S"))),
~ as.numeric(ifelse(. %in% c("Y", "I", "V", "S"), 1, 0))
))
# post check
print(unique(sqf_data$firearm_flag))
## [1] 0 1
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] 0 1
We also bin the time of the stop from
stop_frisk_time:
sqf_data <- sqf_data %>%
mutate(
time_of_day = case_when(
str_extract(stop_frisk_time, "^\\d{2}") %in% c("00", "01", "02", "03", "04", "05") ~ "Late Night",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("06", "07", "08", "09", "10", "11") ~ "Morning",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("12", "13", "14", "15", "16", "17") ~ "Afternoon",
str_extract(stop_frisk_time, "^\\d{2}") %in% c("18", "19", "20", "21", "22", "23") ~ "Evening",
TRUE ~ NA_character_
),
time_of_day = factor(time_of_day, levels = c("Late Night", "Morning", "Afternoon", "Evening"))
)
# check
print(table(sqf_data$time_of_day))
##
## Late Night Morning Afternoon Evening
## 2807 909 2753 5626
# now drop stop frisk time as we will just use time_of_day
sqf_data <- sqf_data %>%
select(-"stop_frisk_time")
We also convert other relevant variables to factors as needed:
# convert character columns to factors, except for stop location x and y
sqf_data <- sqf_data %>%
mutate(across(
.cols = where(is.character) & !c("stop_location_x", "stop_location_y"),
.fns = as.factor
))
Next, we will make suspect_height,
suspect_weight, and suspect_reported_age
numeric keeping in mind the factor levels to ensure R converts them
correctly. Then, we will address outliers in the data.
# define convert factor to numeric, handling non-numeric entries
convert_to_numeric <- function(x) {
# convert to character to avoid factor levels issues
x <- as.character(x)
# replace non-numeric values with NA (e.g., "unknown", "760", "7.6")
x <- gsub("[^0-9.]", "", x)
# convert to numeric
as.numeric(x)
}
# apply the function to the relevant columns
sqf_data <- sqf_data %>%
mutate(
suspect_reported_age = convert_to_numeric(suspect_reported_age),
suspect_height = convert_to_numeric(suspect_height),
suspect_weight = convert_to_numeric(suspect_weight)
)
# check to make sure successful
summary(sqf_data$suspect_reported_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.00 25.00 27.77 34.00 118.00
summary(sqf_data$suspect_height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 5.400 5.700 5.649 5.900 7.600
summary(sqf_data$suspect_weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 150.0 160.0 166.3 180.0 760.0
Before addressing outliers in the three numerical features, we first need to transform the height variable as it is currently in the format of feet.inches. This creates an issue as an observation of 5.10 means 5 feet 10 inches, but 5.1 means 5 feet 1 inch. We will transform the inches into feet so each observation is only in feet.
It should be noted that in the raw data we have values such as 5.90, which really means 5 feet 9 inches, not 5 feet 90 inches. To address this during the conversion, we checked to see if the inches amount had 1 or 2 decimals and handled accordingly. We also ensured all the inches amounts were in between 0 and 11.9 to ensure none were accidentally converted incorrectly.
# Function to convert feet.inches to feet
convert_to_feet <- function(feet_inches) {
# Extract feet (integer part)
feet <- floor(feet_inches)
# Get the fractional part
fractional_part <- feet_inches - feet
# Interpret inches based on the fractional part
if (fractional_part == 0) {
# No fractional part means no additional inches
inches <- 0
} else {
# Convert fractional part to a string to check its length
fractional_str <- as.character(fractional_part)
if (grepl("\\.\\d$", fractional_str)) {
# Case like `.1`: Single digit after decimal -> 1, 2, ..., 9 inches
inches <- fractional_part * 10
} else if (grepl("\\.\\d0$", fractional_str)) {
# Case like `.10`, `.20`, etc.: Two digits ending in `0` -> 10, 20, etc. inches
inches <- fractional_part * 100
} else {
# Case like `.11`, `.12`, etc.: Two digits not ending in `0` -> Exact inches
inches <- fractional_part * 100
}
}
# Validate inches (should be between 0 and 11)
if (inches < 0 || inches > 11.9) {
warning(paste("Invalid height input:", feet_inches, "- Inches must be between 0 and 11.9. Returning NA."))
return(NA) # We have no NAs, so inches were extracted correctly.
}
# Convert inches to feet
inches_in_feet <- inches / 12
# Return total height in feet
return(feet + inches_in_feet)
}
# Apply the conversion function to the 'suspect_height' column
sqf_data$suspect_height_feet <- sapply(sqf_data$suspect_height, convert_to_feet)
# Check the result
tail(sqf_data[, c("suspect_height", "suspect_height_feet")])
## # A tibble: 6 × 2
## suspect_height suspect_height_feet
## <dbl> <dbl>
## 1 5.7 5.58
## 2 5.11 5.92
## 3 5.9 5.75
## 4 5.8 5.67
## 5 6.2 6.17
## 6 5.8 5.67
sqf_data$suspect_height <- sqf_data$suspect_height_feet
#drop suspect
sqf_data <- sqf_data %>% dplyr::select(-suspect_height_feet)
We now plot the density of height in feet to see the distribution and identify outliers. We are going to drop observations below 4 ft and above 7 ft.
# compute density for suspect_height
height_density <- density(sqf_data$suspect_height, na.rm = TRUE)
# plot the density
plot(height_density,
main = "Density of Height with Outliers Highlighted",
xlab = "Height",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# identify and highlight outliers (e.g., heights below 4 feet and above 7 feet)
outliers <- sqf_data$suspect_height[sqf_data$suspect_height < 4 | sqf_data$suspect_height > 7]
# add vertical lines to mark the outlier boundaries
abline(v = c(4, 7), col = "darkorchid2", lty = 2, lwd = 2)
# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkorchid2", pch = 19)
# drop outlier observations where height is above 7 ft and below 4 ft.
sqf_data <- sqf_data[sqf_data$suspect_height >= 4 & sqf_data$suspect_height <= 7, ]
We plot the density of age to see the distribution. Since we have
some outliers, we dropped any observations where age is below 10 and
above 85. We also noticed the suspect_reported_age data is
quiet skewed, so we applied a logarithmic transformation to this
variable.
# compute density for reported_age
reported_age_density <- density(sqf_data$suspect_reported_age, na.rm = TRUE)
# plot the density
plot(reported_age_density,
main = "Density of Reported Age with Outliers Highlighted",
xlab = "Reported Age",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# identify and highlight outliers (ages < 10 and > 85)
outliers <- sqf_data$suspect_reported_age[sqf_data$suspect_reported_age < 10 | sqf_data$suspect_reported_age > 85]
# add vertical lines to mark the outlier boundaries
abline(v = c(10, 85), col = "deeppink", lty = 2, lwd = 2)
# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "deeppink", pch = 19)
# drop outlier observations where age is above 85 and below 10.
sqf_data <- sqf_data[sqf_data$suspect_reported_age >= 10 & sqf_data$suspect_reported_age <= 85, ]
#Since the `suspect_reported_age` data is quiet skewed, we will perform a logarithmic transformation.
sqf_data$suspect_reported_age <- log(sqf_data$suspect_reported_age)
We plot the density of weight to identify outliers. The weight variable is measured in pounds and it appears that some observations could have data entry errors (for example: 9, 16, 760). We will drop outliers above 300 lbs and below 90 lbs.
# compute density for suspect_weight
weight_density <- density(sqf_data$suspect_weight, na.rm = TRUE)
# plot the density
plot(weight_density,
main = "Density of Suspect Weight with Outliers Highlighted",
xlab = "Weight",
ylab = "Density",
col = "black",
lwd = 2)
grid()
# identify and highlight outliers (e.g., weights below 90 lbs and above 300 lbs)
outliers <- sqf_data$suspect_weight[sqf_data$suspect_weight < 90 | sqf_data$suspect_weight > 300]
# add vertical lines to mark the outlier boundaries
abline(v = c(90, 300), col = "darkturquoise", lty = 2, lwd = 2)
# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkturquoise", pch = 19)
# drop outlier observations where height is above 350 lbs. and below 90 lbs.
sqf_data <- sqf_data[sqf_data$suspect_weight >= 90 & sqf_data$suspect_weight <= 300, ]
Finally, we will standardize our numerical variables as they use different scales. This will allow us to better compare coefficients of these variables.
# standardize the numeric variables
sqf_data <- sqf_data %>%
mutate(
suspect_reported_age = scale(suspect_reported_age),
suspect_height = scale(suspect_height),
suspect_weight = scale(suspect_weight)
)
# check if the standardization worked by summarizing the variables
summary(sqf_data$suspect_reported_age)
## V1
## Min. :-2.3918
## 1st Qu.:-0.7609
## Median :-0.0636
## Mean : 0.0000
## 3rd Qu.: 0.7177
## Max. : 2.8599
summary(sqf_data$suspect_height)
## V1
## Min. :-5.25715
## 1st Qu.:-0.56747
## Median :-0.07382
## Mean : 0.00000
## 3rd Qu.: 0.41983
## Max. : 3.62856
summary(sqf_data$suspect_weight)
## V1
## Min. :-2.4893
## 1st Qu.:-0.5337
## Median :-0.2078
## Mean : 0.0000
## 3rd Qu.: 0.4440
## Max. : 4.3551
# check dim again
dim(sqf_data)
## [1] 12044 59
# tabulation of the dependent variable
sqf_data %>%
group_by(suspect_arrested_flag) %>%
summarise(N = n(),
Pc = N / nrow(sqf_data) * 100) %>%
arrange(desc(N)) %>%
kable(booktabs = TRUE, col.names = c("Suspect Arrested", "N Stops", "% Total Stops"), align = "l")
| Suspect Arrested | N Stops | % Total Stops |
|---|---|---|
| 0 | 8270 | 68.6649 |
| 1 | 3774 | 31.3351 |
# looking at the distribution of sex
ggplot(sqf_data, aes(x = suspect_sex, fill = suspect_sex)) +
geom_bar() +
labs(
title = "Distribution of Suspect Sex",
x = "Sex",
y = "N Stops"
) +
theme_minimal() +
scale_fill_manual(
values = c("MALE" = "lightblue", "FEMALE" = "pink")
) +
theme(legend.position = "none")
# sex by arrest status
ggplot(sqf_data, aes(x = suspect_sex, fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
labs(title = "Distribution of Arrest Status, By Suspect Sex",
x = "Sex",
y = "% Stops") +
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(type = "qual", palette = "Pastel2", name = "Suspect Arrested")
# empirical cdf of age by sex and arrest status
ggplot(sqf_data, aes(x = suspect_reported_age, color = factor(suspect_arrested_flag))) +
stat_ecdf(geom = "step") +
facet_wrap(~ suspect_sex, ncol = 2) +
scale_color_manual(values = c("0" = "red", "1" = "darkgreen"),
labels = c("Not Arrested", "Arrested"),
name = "Arrest Outcome") +
labs(x = "Suspect Reported Age", y = "ECDF", title = "Empirical CDF of Suspect Reported Age, By Sex and Arrest Status") +
theme_minimal()
# arrests by race, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
geom_bar() +
coord_flip() +
theme_minimal() +
xlab("Suspect Race") +
ylab("N Stops") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Distribution of Arrest Status, By Suspect Race")
# arrests by race, stacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
xlab("Suspect Race") +
ylab("% Stops") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Distribution of Arrest Status, By Suspect Race")
# arrests by suspected crime description, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspected_crime_description)), fill = factor(suspect_arrested_flag))) +
geom_bar() +
coord_flip() +
theme_minimal() +
xlab("Suspected Crime") +
ylab("N Stops") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Distribution of Arrest Status, By Suspected Crime")
# arrests by suspected crime description, stacked
ggplot(data = sqf_data, aes(x = suspected_crime_description, fill = factor(suspect_arrested_flag))) +
geom_bar(position = "fill") +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
xlab("Suspected Crime") +
ylab("% Stops") +
scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
labs(title = "Distribution of Arrest Status, By Suspect Crime")
# heatmap of arrests by time of day and borough
ggplot(
sqf_data %>%
group_by(stop_location_boro_name, time_of_day) %>%
summarise(
percent_arrested = mean(suspect_arrested_flag, na.rm = TRUE) * 100,
.groups = "drop"
),
aes(
x = time_of_day,
y = stop_location_boro_name,
fill = percent_arrested
)
) +
geom_tile(color = "white") +
scale_fill_viridis_c(name = "Arrest Rate (%)") +
labs(
title = "Heat Map of Arrest Rates",
x = "Time of Day",
y = "Borough"
) +
theme_minimal()
# heatmap of arrests by issuing officer rank and suspect race
ggplot(
sqf_data %>%
group_by(issuing_officer_rank, suspect_race_description) %>%
summarise(
percent_arrested = mean(suspect_arrested_flag, na.rm = TRUE) * 100,
.groups = "drop"
),
aes(
x = issuing_officer_rank,
y = suspect_race_description,
fill = percent_arrested
)
) +
geom_tile(color = "white") +
scale_fill_viridis_c(name = "Arrest Rate (%)") +
labs(
title = "Heat Map of Arrest Rates",
x = "Officer in Uniform",
y = "Suspect Race"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
We first clean the data for spatial mapping using the sf
and nycgeo packages.
We use this to obtain information about the stops at the census tract level, due to its granularity and the availability of population statistics at this level. insert why relevant for prediction
# drop 7 observations which have incorrect spatial info
sqf_data <- sqf_data %>%
filter(stop_location_x > 0)
dim(sqf_data)
## [1] 12037 59
# make spatial object for mapping
sqf_data_sf <- st_as_sf(sqf_data,
coords = c("stop_location_x", "stop_location_y"),
crs = 2263) # crs for New York (EPSG:2263)
# load in nta-level shapefile
remotes::install_github("mfherman/nycgeo")
## Skipping install of 'nycgeo' from a github remote, the SHA1 (4fee55c1) has not changed since last install.
## Use `force = TRUE` to force installation
library(nycgeo)
nyc_tract_shp <- nycgeo::nyc_boundaries(geography = "tract", add_acs_data = TRUE)
## old-style crs object detected; please recreate object with a recent sf::st_crs()
# check crs
st_crs(nyc_tract_shp)$epsg
## [1] 2263
# plot data onto shapefile by arrest status
ggplot() +
geom_sf(data = nyc_tract_shp, fill = "lightblue", color = "black", size = 0.3) +
geom_sf(data = sqf_data_sf, aes(color = as.factor(suspect_arrested_flag)), size = 0.7) +
scale_color_manual(values = c("red", "seagreen"),
labels = c("Arrested", "Not Arrested")) +
theme_minimal() +
labs(title = "NYC Police Stops by Arrest Status") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.title = element_blank())
Perhaps it is more informative to view the percentage of stops ending in arrest at the tract level:
# join datasets to assign each stop to a tract
sqf_data_sf <- st_join(sqf_data_sf, nyc_tract_shp)
dim(sqf_data_sf)
## [1] 12037 104
# aggregate to tract level
sqf_data_sf_tract_level <- sqf_data_sf %>%
filter(!is.na(geoid)) %>%
group_by(geoid) %>%
summarize(pc_arrest = (sum(suspect_arrested_flag) / n()) * 100)
# join with shp for mapping
sqf_data_sf_tract_level <- nyc_tract_shp %>%
st_join(sqf_data_sf_tract_level, by = "geoid")
ggplot() +
geom_sf(data = sqf_data_sf_tract_level, aes(fill = pc_arrest), color = "black", size = 0.3) +
scale_fill_viridis_c(
name = "% Stops Ending in Arrest",
option = "inferno",
na.value = "white"
) +
theme_void() +
labs(title = "Percentage of Stops Ending in Arrest by NYC Census Tract")
Additionally, the nycgeo package allows us to link with
census tract level American Community Survey (2013-2017) population
statistics.
insert justification for why these - demographic stats of location of the stop - might be predictors of arrest
We visualize those here:
# we first create the new variables in the shapefile for ease of plotting
nyc_tract_shp <- nyc_tract_shp %>%
mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
)
# non hispanic black
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pc_pop_black_est)) +
scale_fill_viridis_c(
name = "Non-Hispanic Black % Total Population",
option = "inferno"
) +
theme_void() +
labs(title = "Non-Hispanic Black Population by Census Tract, ACS 2013-2017")
# hispanic any
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pc_pop_hisp_est)) +
scale_fill_viridis_c(
name = "Hispanic Any Race % Total Population",
option = "inferno"
) +
theme_void() +
labs(title = "Hispanic Any Race Population by Census Tract, ACS 2013-2017")
# non
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pc_pop_asian_est)) +
scale_fill_viridis_c(
name = "Non-hispanic Asian % Total Population",
option = "inferno"
) +
theme_void() +
labs(title = "Non-hispanic Asian Population by Census Tract, ACS 2013-2017")
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pc_pop_ba_above_est)) +
scale_fill_viridis_c(
name = "% Population Aged >= 25 with at Least Bachelors Degree",
option = "inferno"
) +
theme_void() +
labs(title = "% Population Aged >= 25 with at least a Bachelor's Degree by Census Tract, ACS 2013-2017")
# income below pov
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
geom_sf(aes(fill = pc_pop_inpov_est)) +
scale_fill_viridis_c(
name = "% Population With Income Below Poverty Line",
option = "inferno"
) +
theme_void() +
labs(title = "% Population with Income Below Poverty Line, ACS 2013-2017")
We keep only these predictors from the ACS data as predictors for our analysis, as shown below.
# now we create the new variables in the sqf_data_sf object, before merging as appropriate by stop_id into the original data
sqf_data_sf <- sqf_data_sf %>%
mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
)
# check current dim
dim(sqf_data)
## [1] 12037 59
sqf_data <- sqf_data %>%
# left join selected spatial features from the sf object into sqf_data
left_join(sqf_data_sf %>% select(stop_id, pc_pop_ba_above_est, pc_pop_inpov_est, pc_pop_asian_est, pc_pop_hisp_est, pc_pop_black_est), by = "stop_id") %>%
# drop x,y coords and geometry as we use census tract for spatial info
select(-c("stop_location_x", "stop_location_y", "geometry")) %>%
# drop obs with missing values in these spatial features
filter(!if_any(everything(), is.na))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# check new dim
dim(sqf_data)
## [1] 11957 62
For our logistic and penalized regressions, we split our sample in two:
We perform this below:
# set seed for reproducibility
set.seed(1)
# set y and X matrix
y <- sqf_data$suspect_arrested_flag
X <- model.matrix(~ . - suspect_arrested_flag - stop_id -search_basis_incidental_to_arrest_flag, data = sqf_data)
# perform train-test split
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
y_train <- y[train_index]
y_test <- y[-train_index]
X_train <- X[train_index,]
X_test <- X[-train_index, ]
# print lengths and dimensions
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 8370
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 3587
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 8370 157
cat("Dimensions of X_test:", dim(X_test), "\n")
## Dimensions of X_test: 3587 157
# check balance of y
print(table(y_train))
## y_train
## 0 1
## 5752 2618
print(table(y_test))
## y_test
## 0 1
## 2465 1122
Note that, for each model type, we estimate using two sets of matrices of predictors to test sensitivity of prediction accuracy:
the first (X_train and X_test) simply
drops stop_id, due to its irrelevance for prediction, and
also search_basis_incidental_to_arrest_flag, as this
indicates a search was carried out after an arrest. As
we are ultimately interested in predicting arrest status ex-ante to the
stop from the police officer’s perspective [edit]
the second set (X_train_subset and
X_test_subset) also drops all further variables related to
search, frisk and physical force. We drop these for a conservative
approach to prediction, as it is unclear when these occur in relation to
the arrest [edit].
We create this subset below:
# set x subset, removing anything that is/might be realized during the stop
X_subset <- X[, !grepl("^(search|physical_force|.*eye_color|.*hair_color)", colnames(X)) &
!colnames(X) %in% c("frisked_flag", "firearm_flag", "knife_cutter_flag",
"other_weapon_flag", "weapon_found_flag", "other_contraband_flag")]
# perform train-test split
X_train_subset <- X_subset[train_index, ]
X_test_subset <- X_subset[-train_index, ]
cat("Dimensions of X_train_subset:", dim(X_train_subset), "\n")
## Dimensions of X_train_subset: 8370 114
cat("Dimensions of X_test_subset:", dim(X_test_subset), "\n")
## Dimensions of X_test_subset: 3587 114
Next, note that we also require a validation dataset for our tuning of hyperparameters in our tuned elastic net and random forest models.
For this reason, we split X_train_subset further
into:
X_train_subset_finalX_val_subsetand correspondingly, y_train further into:
y_train_finaly_valWe only do these for the subset, as we only tune these models for the subset for computational reasons.2
We create these matrices here:
# subset training data for validation split
n_train <- nrow(X_train)
id_split <- sample(1:n_train, floor(0.5 * n_train))
# split into training and validation sets
X_train_subset_final <- X_train_subset[id_split, ]
y_train_final <- y_train[id_split]
X_val_subset <- X_train_subset[-id_split, ]
y_val <- y_train[-id_split]
# print dimensions
cat("dimensions of X_train_final:", dim(X_train_subset_final), "\n")
## dimensions of X_train_final: 4185 114
cat("dimensions of y_train_final:", length(y_train_final), "\n")
## dimensions of y_train_final: 4185
cat("dimensions of X_val:", dim(X_val_subset), "\n")
## dimensions of X_val: 4185 114
cat("dimensions of y_val:", length(y_val), "\n")
## dimensions of y_val: 4185
We now proceed by running our regression models, evaluating each of them based on their predictive performance in the test data.
Note that as we are working in a binary classification setting, we are looking to minimize misclassification error. The cost ratio in this framework is: \[C = \frac{L(1,0)}{L(0,1)}\].
This measures how costly false negatives are relative to false positives. The optimal decision is \[\hat{y_i} = 1 \iff p_i > {1 \over 1+C}\]
When looking at our confusion matrices, we choose \(C=1\) and weight both types of misclassification equally, and so classify according to the most likely class i.e \[p_i > 0.5 \longrightarrow \hat{y} = 1\].
Our logistic regression model will serve as a simple baseline model.
As it is a simple model, we will use X_train_subset, as
this is a more conservative approach as we drop the variables that may
have occurred before or after arrests.
Since logistic regression is sensitive to multicollinearity, we also
drop the factor variables supervising_officer_rank as they
are closely related to issuing_officer_rank, as it is
likely issuing_officers of a specific rank have the same rank of
supervisors.
# Set X_logit, remove supervising_officer_rank to avoid rank-deficient fit
X_logit <- X_subset[, !grepl("^(supervising_officer_rank)", colnames(X_subset))]
Now we set the training and testing split using the X_logit data and run our model.
# perform train-test split on X_logit
X_train_logit <- X_logit[train_index, ]
X_test_logit <- X_logit[-train_index, ]
# print lengths and dimensions
cat("Dimensions of X_train_logit:", dim(X_train_logit), "\n")
## Dimensions of X_train_logit: 8370 101
cat("Dimensions of X_test_subset:", dim(X_test_logit), "\n")
## Dimensions of X_test_subset: 3587 101
# glm requires dataframes as arguments
X_train_logit_df <- as.data.frame(X_train_logit)
X_test_logit_df <- as.data.frame(X_test_logit)
# full logit model in training data
logit <- glm(y_train ~ ., data = X_train_logit_df, family = binomial(logit))
# get fitted probabilities using trained model on test data
logit_predict <- as.numeric(predict(logit, newdata = X_test_logit_df, type = "response"))
# convert probabilities to class predictions using a threshold of 0.5
logit_y_hat <- ifelse(logit_predict > 0.5, 1, 0)
# inspect class balance of predicted classes
table(logit_y_hat)
## logit_y_hat
## 0 1
## 2828 759
# define a function to generate and plot confusion matrix
generate_cm <- function(true, predicted, title) {
# gen confusion matrix as a data frame
cm <- as.data.frame(table(True = true, Predicted = predicted)) %>%
group_by(Predicted) %>%
mutate(Predicted_pct = Freq / sum(Freq))
# print cm
print(cm)
# plot cm
plot <- ggplot(data = cm, mapping = aes(x = ordered(True, c(1, 0)), y = Predicted, fill = Predicted_pct)) +
geom_tile() +
geom_text(aes(label = round(Predicted_pct, 2)), color = 'white') +
scale_fill_gradient(low = "blue", high = "red", name = "Rel. Freq.") +
xlab("True") +
ylab("Predicted") +
labs(title = title) +
theme_minimal()
# print plot
print(plot)
}
# plot confusion matrix for logistic regression with all variables
generate_cm(y_test, logit_y_hat, "Confusion Matrix for Logistic Regression")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2192 0.775
## 2 1 0 636 0.225
## 3 0 1 273 0.360
## 4 1 1 486 0.640
# compute ROC and AUC
logit_roc <- roc(response = y_test, predictor = logit_predict, quiet = TRUE)
logit_auc <- round(auc(logit_roc), 4)
cat("AUC for the logit model", logit_auc, "\n")
## AUC for the logit model 0.7751
Next we implement LASSO, where: \[ \beta^{LASSO} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} |\beta_j| < t \]
We implement cross-validation in our training sample as [edit]. We use the default \(k=10\) folds for computational efficiency.
As we are operating in a classification setting, we choose the value
of the parameter lambda that maximises the area under the ROC curve, and
implement this by selecting type.measure = "class" in our
cv.glmnet regressions.
# run lasso on training data to collect coefficients
lasso <- cv.glmnet(x=X_train, y=y_train, alpha = 1, family="binomial", type.measure = "class")
# optimal lambda
lasso_lambda_min <- lasso$lambda.min
cat("Optimal Lambda for Lasso:", lasso_lambda_min, "\n")
## Optimal Lambda for Lasso: 0.0009253621
# plot misclassification error against log lambda
plot(lasso)
abline(v = log(lasso_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (LASSO - Full)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)", ylab = "Misclassification Error")
# plot coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Full Model)",
xlab = "Log(Lambda)", ylab = "Coefficients")
# get fitted probabilities, using best lambda
lasso_predict <- as.numeric(predict(lasso, s = lasso_lambda_min, X_test, type = "response"))
# add variable for predicted classes
lasso_y_hat <- ifelse(lasso_predict > 0.5, 1, 0)
# plot confusion matrix for lasso regression out of sample
generate_cm(y_test, lasso_y_hat, "Confusion Matrix for LASSO (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2348 0.910
## 2 1 0 231 0.0896
## 3 0 1 117 0.116
## 4 1 1 891 0.884
# compute ROC and AUC
lasso_roc_full <- roc(response = y_test, predictor = lasso_predict, quiet = TRUE)
lasso_auc_full <- round(auc(lasso_roc_full), 4)
cat("AUC for the LASSO (full) model", lasso_auc_full, "\n")
## AUC for the LASSO (full) model 0.9464
# run lasso on training data to collect coefficients
lasso_subset <- cv.glmnet(x=X_train_subset, y=y_train, alpha = 1, family="binomial", type.measure = "class")
lasso_lambda_min_subset <- lasso_subset$lambda.min
cat("Optimal Lambda for Lasso (Subset):", lasso_lambda_min_subset, "\n")
## Optimal Lambda for Lasso (Subset): 0.007069799
# plot misclassification error against log lambda
plot(lasso_subset)
abline(v = log(lasso_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (LASSO - Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)", ylab = "Misclassification Error")
# plot coefficients
plot(lasso_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Subset)",
xlab = "Log(Lambda)", ylab = "Coefficients")
# get fitted probabilities
lasso_predict_subset <- as.numeric(predict(lasso_subset, s = lasso_lambda_min_subset, X_test_subset, type = "response"))
# add variable for predicted classes
lasso_y_hat_subset <- ifelse(lasso_predict_subset > 0.5, 1, 0)
# plot confusion matrix for lasso subset regression out of sample
generate_cm(y_test, lasso_y_hat_subset, "Confusion Matrix for LASSO (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2294 0.753
## 2 1 0 751 0.247
## 3 0 1 171 0.315
## 4 1 1 371 0.685
# compute roc object for subset model
lasso_roc_subset <- roc(response = y_test, predictor = lasso_predict_subset, quiet = TRUE)
lasso_auc_subset <- round(auc(lasso_roc_subset), 4)
cat("AUC for the LASSO (subset) model", lasso_auc_subset, "\n")
## AUC for the LASSO (subset) model 0.7754
Next, we implement ridge regression: \[ \beta^{Ridge} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} \beta_j^2 < t \].
# run ridge regression on training data to collect coefficients
ridge <- cv.glmnet(x = X_train, y = y_train, alpha = 0, family = "binomial", type.measure = "class")
ridge_lambda_min <- ridge$lambda.min
cat("optimal lambda for ridge:", ridge_lambda_min, "\n")
## optimal lambda for ridge: 0.02457843
# plot misclassification error against log lambda
plot(ridge)
abline(v = log(ridge_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (Ridge - Full)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)", ylab = "Misclassification Error")
# plot coefficients
plot(ridge$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Full Model)",
xlab = "Log(Lambda)", ylab = "Coefficients")
# get fitted probabilities
ridge_predict <- as.numeric(predict(ridge, s = ridge_lambda_min, X_test, type = "response"))
# add variable for predicted classes
ridge_y_hat <- ifelse(ridge_predict > 0.5, 1, 0)
# plot confusion matrix for ridge regression out of sample
generate_cm(y_test, ridge_y_hat, "Confusion Matrix for Ridge (Full Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2357 0.903
## 2 1 0 253 0.0969
## 3 0 1 108 0.111
## 4 1 1 869 0.889
# compute roc object for full ridge model
ridge_roc_full <- roc(response = y_test, predictor = ridge_predict, quiet = TRUE)
ridge_auc_full <- round(auc(ridge_roc_full), 4)
cat("AUC for the Ridge (full) model:", ridge_auc_full, "\n")
## AUC for the Ridge (full) model: 0.9464
# run ridge regression on subset predictors
ridge_subset <- cv.glmnet(x = X_train_subset, y = y_train, alpha = 0, family = "binomial", type.measure = "class")
ridge_lambda_min_subset <- ridge_subset$lambda.min
cat("Optimal lambda for ridge (subset):", ridge_lambda_min_subset, "\n")
## Optimal lambda for ridge (subset): 0.1558983
# plot misclassification error against log lambda
plot(ridge_subset)
abline(v = log(ridge_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation Misclassification Error (Ridge - Subset)",
sub = "Optimal Lambda Highlighted in Red",
xlab = "Log(Lambda)", ylab = "Misclassification Error")
# plot coefficients
plot(ridge_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Subset)",
xlab = "Log(Lambda)", ylab = "Coefficients")
# get fitted probabilities
ridge_predict_subset <- as.numeric(predict(ridge_subset, s = ridge_lambda_min_subset, X_test_subset, type = "response"))
# add variable for predicted classes
ridge_y_hat_subset <- ifelse(ridge_predict_subset > 0.5, 1, 0)
# plot confusion matrix for ridge subset regression out of sample
generate_cm(y_test, ridge_y_hat_subset, "Confusion Matrix for Ridge (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2333 0.743
## 2 1 0 807 0.257
## 3 0 1 132 0.295
## 4 1 1 315 0.705
# compute roc object for subset ridge model
ridge_roc_subset <- roc(response = y_test, predictor = ridge_predict_subset, quiet = TRUE)
ridge_auc_subset <- round(auc(ridge_roc_subset), 4)
cat("AUC for the Ridge (subset) model:", ridge_auc_subset, "\n")
## AUC for the Ridge (subset) model: 0.7681
Now we implement elastic net regression, choosing to tune both \(\lambda\) and \(\alpha\) [insert justification]. For this, we use our validation data, as mentioned.
We tune elastic net in the subset of regressors [just subset here i dont think there is need to do full because then we just have to split more, and we are only doing subset for random forest anyway, takes too long]:
# Define alpha values for grid search
alpha_values <- seq(0, 1, by = 0.1)
# Initialize storage for results
results <- data.frame(alpha = alpha_values, auc = NA, optimal_lambda = NA)
# loop grid search
for (i in 1:nrow(results)) {
alpha_i <- results$alpha[i]
# train en model with cross-validation for the given alpha
en_opt <- cv.glmnet(
x = X_train_subset_final,
y = y_train_final,
alpha = alpha_i,
family = "binomial",
type.measure = "class"
)
# get the best lambda for the current alpha
en_opt_lambda_min <- en_opt$lambda.min
# get fitted probabilities on the validation set using the best lambda and current alpha
en_opt_predict_val <- as.numeric(predict(en_opt, s = en_opt_lambda_min, newx = X_val_subset, type = "response"))
# compute roc and aucs of trained model in validation data
en_opt_roc_val <- roc(response = as.numeric(y_val), predictor = en_opt_predict_val, quiet = TRUE)
en_opt_auc_val <- auc(en_opt_roc_val)
# add to storage vector
results$auc[i] <- en_opt_auc_val
results$optimal_lambda[i] <- en_opt_lambda_min
}
# identify the best alpha
best_params <- results[which.max(results$auc), ]
cat("Optimal alpha:", best_params$alpha, "\n")
## Optimal alpha: 0.2
cat("Optimal lambda:", best_params$optimal_lambda, "\n")
## Optimal lambda: 0.0355349
cat("Best AUC:", best_params$auc, "\n")
## Best AUC: 0.7316938
# train the tuned elastic net model with best hyperparameters
en_opt <- glmnet(
x = X_train_subset_final,
y = y_train_final,
alpha = best_params$alpha,
lambda = best_params$optimal_lambda,
family = "binomial"
)
# get fitted probabilities using trained model on test data
en_opt_predict <- as.numeric(predict(en_opt, s = best_params$optimal_lambda, newx = X_test_subset, type = "response"))
# add variable for predicted classes
en_opt_y_hat <- ifelse(en_opt_predict > 0.5, 1, 0)
# plot confusion matrix for tuned elastic net regression out of sample
generate_cm(y_test, en_opt_y_hat, "Confusion Matrix for Elastic Net (Subset Model)")
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2297 0.747
## 2 1 0 777 0.253
## 3 0 1 168 0.327
## 4 1 1 345 0.673
# compute test ROC and AUC out of sample
en_opt_roc_subset <- roc(response = as.numeric(y_test), predictor = en_opt_predict, quiet = TRUE)
en_opt_auc_subset <- auc(en_opt_roc_subset)
cat("Test AUC for Tuned Elastic Net model:", en_opt_auc_subset, "\n")
## Test AUC for Tuned Elastic Net model: 0.7708992
Next, we implement the random forest algorithm [justification].
We do this on the subset of the data as [justification].
We tune the maximum number features of the tree first. We want to refine the maximum number of features to simplify the model. We choose the maximum number of features to loop over as \(m = 20\) and \(ntree \in {50, 100, 150}\), as [elaborate here - mainly computational and want to just simplify the model].
# Initialize parameter grid for grid search
ntrees <- seq(50, 150, 50) # range of trees to tune over
max_features_range <- 1:20 # Range of mtry (max features) to tune over
# initialize storage for accuracy metrics
results <- expand.grid(ntree = ntrees, mtry = max_features_range)
dim(results)
## [1] 60 2
results$train_acc <- NA
results$test_acc <- NA
results$oob_acc <- NA
# tune
for (i in 1:nrow(results)) {
ntrees_i <- results$ntree[i]
mtry_i <- results$mtry[i]
# train rf model using training data and current iteration of mtry and ntree
rf <- randomForest(
x = X_train_subset_final,
y = factor(y_train_final),
mtry = mtry_i,
ntree = ntrees_i
)
# use trained model to predict y in validation data
y_pred_val <- predict(rf, X_val_subset)
# get trained model predictions of y in training data
y_pred_train <- rf$predicted
# compute out of bag accuracy measure
results$oob_acc[i] <- 1 - rf$err.rate[ntrees_i, "OOB"]
# compute training accuracy measure
results$train_acc[i] <- mean(y_train_final == as.numeric(levels(y_pred_train)[y_pred_train]))
# compute "test" - validation - accuracy
results$test_acc[i] <- mean(y_val == as.numeric(levels(y_pred_val)[y_pred_val]))
}
# identify the optimal parameters
best_params <- results[which.max(results$test_acc), ]
cat("Optimal ntree:", best_params$ntree, "\n")
## Optimal ntree: 100
cat("Optimal mtry:", best_params$mtry, "\n")
## Optimal mtry: 13
# define accuracy metrics and y-axis range for tuning plot
accuracy_metrics <- results[results$ntree == best_params$ntree, ]
y_range <- range(c(accuracy_metrics$train_acc, accuracy_metrics$test_acc, accuracy_metrics$oob_acc)) + c(-0.01, 0.01)
# plot train, validation, and OOB accuracy for optimal ntree as mtry varies
plot(
accuracy_metrics$mtry, accuracy_metrics$train_acc,
type = 'l', col = 'blue', ylim = y_range,
main = 'Tuning mtry for Optimal ntree',
xlab = 'Number of Features (mtry)', ylab = 'Accuracy'
)
lines(accuracy_metrics$mtry, accuracy_metrics$test_acc, col = 'red')
lines(accuracy_metrics$mtry, accuracy_metrics$oob_acc, col = 'green', lty = 2)
legend('bottomright',
legend = c('Train Accuracy', 'Validation Accuracy', 'OOB Accuracy'),
col = c('blue', 'red', 'green'), lty = c(1, 1, 2), bty = 'n')
# given optimal parameters, train using training data to get tuned model
RFTuned <- randomForest(
x = X_train_subset_final,
y = factor(y_train_final),
mtry = best_params$mtry,
ntree = best_params$ntree
)
# using model trained with optimal parameters, predict with TEST data
RFPred <- predict(RFTuned, newdata = X_test_subset)
# compute associated relevant fitted probabilities
RFProb <- predict(RFTuned, newdata = X_test_subset, type = "prob")
# compute test roc and auc of tuned model
RF_roc <- roc(response = factor(y_test), predictor = RFProb[, 2], levels = c(0, 1), quiet = TRUE)
RF_auc <- round(auc(RF_roc), 4)
cat("Random Forest AUC (Subset):", RF_auc, "\n")
## Random Forest AUC (Subset): 0.7919
# gen confusion matrix
generate_cm(
true = y_test,
predicted = RFPred,
title = "Confusion Matrix for Random Forest (Subset)"
)
## # A tibble: 4 × 4
## # Groups: Predicted [2]
## True Predicted Freq Predicted_pct
## <fct> <fct> <int> <dbl>
## 1 0 0 2271 0.763
## 2 1 0 706 0.237
## 3 0 1 194 0.318
## 4 1 1 416 0.682
# function to plot ROC curves for multiple models
plot_multiple_roc <- function(roc_list, labels, colors, lwd_list, lty_list, main_title) {
# Plot the first model as the base plot
plot(
x = 1 - roc_list[[1]]$specificities,
y = roc_list[[1]]$sensitivities,
type = "l",
col = colors[1],
lwd = lwd_list[1],
lty = lty_list[1],
xlim = c(0, 1),
ylim = c(0, 1),
main = main_title,
xlab = "False Positive Rate",
ylab = "True Positive Rate",
cex.lab = 1.5,
cex.main = 1.8,
cex.axis = 1.2
)
# Add the remaining ROC curves
for (i in 2:length(roc_list)) {
lines(
x = 1 - roc_list[[i]]$specificities,
y = roc_list[[i]]$sensitivities,
col = colors[i],
lwd = lwd_list[i],
lty = lty_list[i]
)
}
# Add legend
legend(
"bottomright",
legend = labels,
col = colors,
lwd = lwd_list,
lty = lty_list,
bty = "n"
)
}
# set list of input rcs and associated labels, colors etc for arguments
roc_list <- list(logit_roc, lasso_roc_full, ridge_roc_full, RF_roc, ridge_roc_subset, lasso_roc_subset, en_opt_roc_subset)
labels <- c(
sprintf("LASSO Full (AUC: %.4f)", lasso_auc_full),
sprintf("Ridge Full (AUC: %.4f)", ridge_auc_full),
sprintf("Logistic Regression (AUC: %.4f)", logit_auc),
sprintf("Random Forest Subset (AUC: %.4f)", RF_auc),
sprintf("Ridge Subset (AUC: %.4f)", ridge_auc_subset),
sprintf("LASSO Subset (AUC: %.4f)", lasso_auc_subset),
sprintf("Elastic Net Subset (AUC: %.4f)", en_opt_auc_subset)
)
colors <- c("red", "darkgreen", "blue", "darkorange", "lightgreen", "pink", "purple")
lwd_list <- c(3, 3, 2, 2, 2, 2, 2)
lty_list <- c(2, 3, 1, 1, 3, 2, 1)
#call
plot_multiple_roc(
roc_list = roc_list,
labels = labels,
colors = colors,
lwd_list = lwd_list,
lty_list = lty_list,
main_title = "ROC Curves for All Models"
)
# make and print sorted table of AUCs
auc_table <- data.frame(
Model = c(
"Logistic Regression",
"LASSO Full",
"LASSO Subset",
"Ridge Full",
"Ridge Subset",
"Elastic Net Subset",
"Random Forest Subset"
),
AUC = c(
logit_auc,
lasso_auc_full,
lasso_auc_subset,
ridge_auc_full,
ridge_auc_subset,
en_opt_auc_subset,
RF_auc
)
)
auc_table <- auc_table %>%
arrange(desc(AUC))
print(auc_table)
## Model AUC
## 1 LASSO Full 0.9464000
## 2 Ridge Full 0.9464000
## 3 Random Forest Subset 0.7919000
## 4 LASSO Subset 0.7754000
## 5 Logistic Regression 0.7751000
## 6 Elastic Net Subset 0.7708992
## 7 Ridge Subset 0.7681000
Insert interpretation on relative out of sample predictive performance
# Function to compute and plot variable importance
compute_variable_importance <- function(coefficients, model_name, top_n = 20) {
importance_df <- data.frame(
Variable = names(coefficients),
Scaled_Importance = abs(coefficients) / max(abs(coefficients), na.rm = TRUE)
) %>%
filter(Scaled_Importance > 0) %>%
arrange(desc(Scaled_Importance)) %>%
slice(1:top_n)
ggplot(importance_df, aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
coord_flip() +
labs(
title = paste("Top", top_n, "Var Imp for", model_name),
x = "Variable",
y = "Scaled Importance"
) +
theme_minimal()
}
# 1. Logistic Regression
logit_coeff <- coef(logit)
logit_coeff <- logit_coeff[-1]
logit_plot <- compute_variable_importance(logit_coeff, "Logistic Regression")
print(logit_plot)
# 2. LASSO Subset
lasso_subset_coeff <- as.numeric(coef(lasso_subset, s = lasso_lambda_min_subset))
names(lasso_subset_coeff) <- rownames(coef(lasso_subset, s = lasso_lambda_min_subset))
lasso_subset_coeff <- lasso_subset_coeff[-1]
lasso_plot <- compute_variable_importance(lasso_subset_coeff, "LASSO Subset")
print(lasso_plot)
# 3. Ridge Subset
ridge_subset_coeff <- as.numeric(coef(ridge_subset, s = ridge_lambda_min_subset))
names(ridge_subset_coeff) <- rownames(coef(ridge_subset, s = ridge_lambda_min_subset))
ridge_subset_coeff <- ridge_subset_coeff[-1]
ridge_plot <- compute_variable_importance(ridge_subset_coeff, "Ridge Subset")
print(ridge_plot)
# Extract Random Forest importance
rf_importance_subset <- data.frame(
Variable = rownames(RFTuned$importance),
Scaled_Importance = RFTuned$importance[, "MeanDecreaseGini"] / max(RFTuned$importance[, "MeanDecreaseGini"], na.rm = TRUE)
)
rf_importance_subset <- rf_importance_subset %>%
arrange(desc(Scaled_Importance))
# plot rf variable importance - top 20
ggplot(rf_importance_subset %>% arrange(desc(Scaled_Importance)) %>% slice(1:20),
aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
coord_flip() +
labs(title = "Top 20 Var Imp for RF",
x = "Variable",
y = "Scaled Importance") +
theme_minimal()
Insert interpretation on importance
summary, racial component?
note limitations of our analysis, possible extensions